NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Privacy-Preserving Range Aggregation Queries Using a Learning-Based Approach

https://doi.org/10.1109/PerComWorkshops65533.2025.00102

Guan, Hong; Zou, Jia (March 2025, IEEE)

Free, publicly-accessible full text available March 17, 2026
Privacy and Accuracy-Aware AI/ML Model Deduplication

https://doi.org/10.1145/3725340

Guan, Hong; Yu, Lei; Zhou, Lixi; Xiong, Li; Chowdhury, Kanchan; Xie, Lulu; Xiao, Xusheng; Zou, Jia (June 2025, Proceedings of the ACM on Management of Data)

With the growing adoption of privacy-preserving machine learning algorithms, such as Differentially Private Stochastic Gradient Descent (DP-SGD), training or fine-tuning models on private datasets has become increasingly prevalent. This shift has led to the need for models offering varying privacy guarantees and utility levels to satisfy diverse user requirements. Managing numerous versions of large models introduces significant operational challenges, including increased inference latency, higher resource consumption, and elevated costs. Model deduplication is a technique widely used by many model serving and database systems to support high-performance and low-cost inference queries and model diagnosis queries. However, none of the existing model deduplication works has considered privacy, leading to unbounded aggregation of privacy costs for certain deduplicated models and inefficiencies when applied to deduplicate DP-trained models. We formalize the problem of deduplicating DP-trained models for the first time and propose a novel privacy- and accuracy-aware deduplication mechanism to address the problem. We developed a greedy strategy to select and assign base models to target models to minimize storage and privacy costs. When deduplicating a target model, we dynamically schedule accuracy validations and apply the Sparse Vector Technique to reduce the privacy costs associated with private validation data. Compared to baselines, our approach improved the compression ratio by up to 35× for individual models (including large language models and vision transformers). We also observed up to 43× inference speedup due to the reduction of I/O operations.
more » « less
Free, publicly-accessible full text available June 17, 2026
IDNet: A Novel Identity Document Dataset via Few-Shot and Quality-Driven Synthetic Data Generation

https://doi.org/10.1109/BigData62323.2024.10825017

Xie, Lulu; Wang, Yancheng; Guan, Hong; Nag, Soham; Goel, Rajeev; Swamy, Niranjan; Yang, Yingzhen; Xiao, Chaowei; Prisby, Jonathan; Maciejewski, Ross; et al (December 2024, IEEE)

Full Text Available
A Comparison of End-to-End Decision Forest Inference Pipelines

https://doi.org/10.1145/3620678.3624656

Guan, Hong; Masood, Saif; Dwarampudi, Mahidhar; Gunda, Venkatesh; Min, Hong; Yu, Lei; Nag, Soham; Zou, Jia (October 2023, Proceedings of 2023 ACM Symposium on Cloud Computing (SoCC'23))

Decision forest, including RandomForest, XGBoost, and Light-GBM, dominates the machine learning tasks over tabular data. Recently, several frameworks were developed for decision forest inference, such as ONNX, TreeLite from Amazon, TensorFlow Decision Forest from Google, HummingBirdfrom Microsoft, Nvidia FIL, and lleaves. While these frameworks are fully optimized for inference computations, they are all decoupled with databases and general data management frameworks, which leads to cross-system performance overheads. We first provided a DICT model to understand the performance gaps between decoupled and in-database inference. We further identified that for in-database inference, in addition to the popular UDF-centric representation that encapsulates the ML into one User Defined Function(UDF), there also exists a relation-centric representation that breaks down the decision forest inference into several fine-grained SQL operations. The relation-centric representation can achieve significantly better performance for large models. We optimized both implementations and conducted a comprehensive benchmark to compare these two implementations to the aforementioned decoupled inference pipelines and existing in-database inference pipelines such as Spark-SQL and PostgresML. The evaluation results validated the DICT model and demonstrated the superior performance of our in-database inference design compared to the baselines.
more » « less
Automatic Data Transformation Using Large Language Model - An Experimental Study on Building Energy Data

https://doi.org/10.1109/BigData59044.2023.10386931

Sharma, Ankita; Li, Xuanmao; Guan, Hong; Sun, Guoxin; Zhang, Liang; Wang, Lanjun; Wu, Kesheng; Cao, Lei; Zhu, Erkang; Sim, Alexander; et al (December 2023, Proceedings of 2023 IEEE International Conference on Big Data (IEEE BigData 2023))

Existing approaches to automatic data transformation are insufficient to meet the requirements in many real-world scenarios, such as the building sector. First, there is no convenient interface for domain experts to provide domain knowledge easily. Second, they require significant training data collection overheads. Third, the accuracy suffers from complicated schema changes. To address these shortcomings, we present a novel approach that leverages the unique capabilities of large language models (LLMs) in coding, complex reasoning, and zero-shot learning to generate SQL code that transforms the source datasets into the target datasets. We demonstrate the viability of this approach by designing an LLM-based framework, termed SQLMorpher, which comprises a prompt generator that integrates the initial prompt with optional domain knowledge and historical patterns in external databases. It also implements an iterative prompt optimization mechanism that automatically improves the prompt based on flaw detection. The key contributions of this work include (1) pioneering an end-to-end LLM-based solution for data transformation, (2) developing a benchmark dataset of 105 real-world building energy data transformation problems, and (3) conducting an extensive empirical evaluation where our approach achieved 96% accuracy in all 105 problems. SQLMorpher demonstrates the effectiveness of utilizing LLMs in complex, domain-specific challenges, highlighting the potential of their potential to drive sustainable solutions.
more » « less
Impact of Initialized Land Surface Temperature and Snowpack on Subseasonal to Seasonal Prediction Project, Phase I (LS4P-I): organization and experimental design

https://doi.org/10.5194/gmd-14-4465-2021

Xue, Yongkang; Yao, Tandong; Boone, Aaron A.; Diallo, Ismaila; Liu, Ye; Zeng, Xubin; Lau, William K.; Sugimoto, Shiori; Tang, Qi; Pan, Xiaoduo; et al (January 2021, Geoscientific Model Development)

Abstract. Subseasonal-to-seasonal (S2S) prediction, especially the prediction of extreme hydroclimate events such as droughts and floods, is not only scientifically challenging, but also has substantial societal impacts. Motivated by preliminary studies, the Global Energy and Water Exchanges(GEWEX)/Global Atmospheric System Study (GASS) has launched a new initiativecalled “Impact of Initialized Land Surface Temperature and Snowpack on Subseasonal to Seasonal Prediction” (LS4P) as the first international grass-roots effort to introduce spring land surface temperature(LST)/subsurface temperature (SUBT) anomalies over high mountain areas as acrucial factor that can lead to significant improvement in precipitationprediction through the remote effects of land–atmosphere interactions. LS4P focuses on process understanding and predictability, and hence it is differentfrom, and complements, other international projects that focus on theoperational S2S prediction. More than 40 groups worldwide have participated in this effort, including 21 Earth system models, 9 regionalclimate models, and 7 data groups. This paper provides an overview of the history and objectives of LS4P, provides the first-phase experimental protocol (LS4P-I) which focuses on the remote effect ofthe Tibetan Plateau, discusses the LST/SUBT initialization, and presents thepreliminary results. Multi-model ensemble experiments and analyses ofobservational data have revealed that the hydroclimatic effect of the springLST on the Tibetan Plateau is not limited to the Yangtze River basin but may have a significant large-scale impact on summer precipitation beyond EastAsia and its S2S prediction. Preliminary studies and analysis have alsoshown that LS4P models are unable to preserve the initialized LST anomaliesin producing the observed anomalies largely for two main reasons: (i) inadequacies in the land models arising from total soil depths which are tooshallow and the use of simplified parameterizations, which both tend to limit the soil memory; (ii) reanalysis data, which are used for initial conditions, have large discrepancies from the observed mean state andanomalies of LST over the Tibetan Plateau. Innovative approaches have beendeveloped to largely overcome these problems.
more » « less
Full Text Available
Obligate mutualism within a host drives the extreme specialization of a fig wasp genome

https://doi.org/10.1186/gb-2013-14-12-r141

Xiao, Jin-Hua; Yue, Zhen; Jia, Ling-Yi; Yang, Xin-Hua; Niu, Li-Hua; Wang, Zhuo; Zhang, Peng; Sun, Bao-Fa; He, Shun-Min; Li, Zi; et al (January 2013, Genome Biology)

Full Text Available

Search for: All records